Data visualization in practice

Case study: new insights on poverty

Journalists and lobbyists tell dramatic stories. That’s their job. They tell stories about extraordinary events and unusual people. The piles of dramatic stories pile up in peoples’ minds into an over-dramatic worldview and strong negative stress feelings: “The world is getting worse!”, “It’s we vs. them!”, “Other people are strange!”, “The population just keeps growing!” and “Nobody cares!”

library(tidyverse)
library(dslabs)
library(ggrepel)
data(gapminder)
gapminder |> as_tibble()
## # A tibble: 10,545 × 9
##    country   year infant_mortality life_expectancy fertility population      gdp
##    <fct>    <int>            <dbl>           <dbl>     <dbl>      <dbl>    <dbl>
##  1 Albania   1960            115.             62.9      6.19    1636054 NA      
##  2 Algeria   1960            148.             47.5      7.65   11124892  1.38e10
##  3 Angola    1960            208              36.0      7.32    5270844 NA      
##  4 Antigua…  1960             NA              63.0      4.43      54681 NA      
##  5 Argenti…  1960             59.9            65.4      3.11   20619075  1.08e11
##  6 Armenia   1960             NA              66.9      4.55    1867396 NA      
##  7 Aruba     1960             NA              65.7      4.82      54208 NA      
##  8 Austral…  1960             20.3            70.9      3.45   10292328  9.67e10
##  9 Austria   1960             37.3            68.8      2.7     7065525  5.24e10
## 10 Azerbai…  1960             NA              61.3      5.57    3897889 NA      
## # ℹ 10,535 more rows
## # ℹ 2 more variables: continent <fct>, region <fct>

Scatterplots

filter(gapminder, year == 1962) |>
  ggplot(aes(fertility, life_expectancy)) +
  geom_point()

Show continent with color

filter(gapminder, year == 1962) |>
  ggplot( aes(fertility, life_expectancy, color = continent)) +
  geom_point() 

Faceting

Show a scatterplot for two years and each continent separately.

filter(gapminder, year%in%c(1962, 2012)) |>
  ggplot(aes(fertility, life_expectancy, col = continent)) +
  geom_point() +
  facet_grid(year~continent)

Combine the continents and use color to denote them

filter(gapminder, year%in%c(1962, 2012)) |>
  ggplot(aes(fertility, life_expectancy, col = continent)) +
  geom_point() +
  facet_grid(. ~ year)

facet_wrap

Let’s show same plot for 5 years

years <- c(1962, 1980, 1990, 2000, 2012)
continents <- c("Europe", "Asia")
gapminder |> 
  filter(year %in% years & continent %in% continents) |>
  ggplot( aes(fertility, life_expectancy, col = continent)) +
  geom_point() +
  facet_wrap(~year) 

Fixed scales for better comparisons

ggplot defaults to using the same y-axis and x-axis range

Here is how to change that:

filter(gapminder, year%in%c(1962, 2012)) |>
  ggplot(aes(fertility, life_expectancy, col = continent)) +
  geom_point() +
  facet_wrap(. ~ year, scales = "free")

But it makes comparisons difficult.

Time series plots

Time series plot:

gapminder |> 
  filter(country == "United States") |> 
  ggplot(aes(year, fertility)) +
  geom_point()

here it is with lines:

gapminder |> 
  filter(country == "United States") |> 
  ggplot(aes(year, fertility)) +
  geom_line()

Using lines is particularly helpful with comparisons.

Let’s try it:

countries <- c("South Korea","Germany")

gapminder |> filter(country %in% countries) |> 
  ggplot(aes(year,fertility)) +
  geom_line()

What happened?

Need to let ggplot know there are two lines:

countries <- c("South Korea","Germany")

gapminder |> filter(country %in% countries & !is.na(fertility)) |> 
  ggplot(aes(year, fertility, group = country)) +
  geom_line()

Use color to distinguish them:

countries <- c("South Korea","Germany")

gapminder |> filter(country %in% countries & !is.na(fertility)) |> 
  ggplot(aes(year,fertility, col = country)) +
  geom_line()

Labels instead of legends

For trend plots we recommend labeling the lines rather than using legends.

The package geomtextpath

library(geomtextpath)

gapminder |> 
  filter(country %in% countries) |> 
  ggplot(aes(year, life_expectancy, col = country, label = country)) +
  geom_textpath() +
  theme(legend.position = "none")

Data transformations

covert gdp to dollars per day

gapminder <- gapminder |>  mutate(dollars_per_day = gdp/population/365)

Log transformation

Here is a histogram of per day incomes from 1970:

past_year <- 1970
gapminder |> 
  filter(year == past_year & !is.na(gdp)) |>
  ggplot(aes(dollars_per_day)) + 
  geom_histogram(binwidth = 1, color = "black")

We use the color = "black" argument to draw a boundary and clearly distinguish the bins.

Here is the distribution if we apply a log base 2 transform:

gapminder |> 
  filter(year == past_year & !is.na(gdp)) |>
  ggplot(aes(log2(dollars_per_day))) + 
  geom_histogram(binwidth = 1, color = "black")

In a way this provides a close-up of the mid to lower income countries.

Which base?

  • log base 10 usueful for changes in order of magnitude

  • log base 2 useful in most other circumstances

  • log natural base rarely if ever useful for dataviz

For population log 10 probably best:

filter(gapminder, year == past_year) |>
  summarize(min = min(population), max = max(population))
##     min       max
## 1 46075 808510713

Here is the histogram of the transformed values:

gapminder |> 
  filter(year == past_year) |>
  ggplot(aes(log10(population))) +
  geom_histogram(binwidth = 0.5, color = "black")

In the above, we quickly see that country populations range between ten thousand and ten billion.

Transform the values or the scale?

Transforming the values make plot more readable. mid values easier to interpret.

----1----x----2--------3----

for log transformed data, we know that the value of x is about 1.5. If the scales are logged:

This saves you the conversion in your head, but figuring out x is harder.

----10---x---100------1000---

gapminder |> 
  filter(year == past_year & !is.na(gdp)) |>
  ggplot(aes(dollars_per_day)) + 
  geom_histogram(binwidth = 1, color = "black") +
  scale_x_continuous(trans = "log2")

Visualizing multimodal distributions

Comparing multiple distributions with boxplots and ridge plots

gapminder |> 
  filter(year == past_year & !is.na(gdp)) |>
  mutate(region = reorder(region, dollars_per_day, FUN = median)) |>
  ggplot(aes(dollars_per_day, region)) +
  geom_point() +
  scale_x_continuous(trans = "log2")  

gapminder <- gapminder |> 
  mutate(group = case_when(
    region %in% c("Western Europe", "Northern Europe","Southern Europe", 
                    "Northern America", 
                  "Australia and New Zealand") ~ "West",
    region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
    region %in% c("Caribbean", "Central America", 
                  "South America") ~ "Latin America",
    continent == "Africa" & 
      region != "Northern Africa" ~ "Sub-Saharan",
    TRUE ~ "Others"))

We turn this group variable into a factor to control the order of the levels:

gapminder <- gapminder |> 
  mutate(group = factor(group, levels = c("Others", "Latin America", 
                                          "East Asia", "Sub-Saharan",
                                          "West")))

In the next section we demonstrate how to visualize and compare distributions across groups.

Boxplots

p <- gapminder |> 
  filter(year == past_year & !is.na(gdp)) |>
  ggplot(aes(group, dollars_per_day)) +
  geom_boxplot() +
  scale_y_continuous(trans = "log2") +
  xlab("") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) 
p

Boxplots have the limitation that by summarizing the data into five numbers, we might miss important characteristics of the data. One way to avoid this is by showing the data.

p + geom_point(alpha = 0.5)

Ridge plots

Boxplots are very efficient in summarizing distributions but can hide bimodalities:

If not too many categories, we can plot density plots instead of boxes.

library(ggridges)
p <- gapminder |> 
  filter(year == past_year & !is.na(dollars_per_day)) |>
  ggplot(aes(dollars_per_day, group)) + 
  scale_x_continuous(trans = "log2") 
p  + geom_density_ridges() 

We can add points to show data:

p + geom_density_ridges(jittered_points = TRUE)

or tick marks:

p + geom_density_ridges(jittered_points = TRUE, 
                        position = position_points_jitter(height = 0),
                        point_shape = '|', point_size = 3, 
                        point_alpha = 1, alpha = 0.7)

Example: 1970 versus 2010 income distributions

past_year <- 1970
present_year <- 2010
years <- c(past_year, present_year)
gapminder |> 
  filter(year %in% years & !is.na(gdp)) |>
  mutate(west = ifelse(group == "West", "West", "Developing")) |>
  ggplot(aes(dollars_per_day)) +
  geom_histogram(binwidth = 1, color = "black") +
  scale_x_continuous(trans = "log2") + 
  facet_grid(year ~ west)

The ecological fallacy and importance of showing the data

The relationship between these two variables is almost perfectly linear and the graph shows a dramatic difference. While in the West less than 0.5% of infants die, in Sub-Saharan Africa the rate is higher than 6%!

Note that the plot uses a new transformation, the logistic transformation.

Logistic transformation

\[f(p) = \log \left( \frac{p}{1-p} \right)\]

Show the data

## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Encoding third variables

Case study: vaccines and infectious diseases

library(tidyverse)
library(RColorBrewer)
library(dslabs)
data(us_contagious_diseases)
names(us_contagious_diseases)
## [1] "disease"         "state"           "year"            "weeks_reporting"
## [5] "count"           "population"
the_disease <- "Measles"
dat <- us_contagious_diseases |>
  filter(!state%in%c("Hawaii","Alaska") & disease == the_disease) |>
  mutate(rate = count / population * 100000 * 52 / weeks_reporting) |> 
  mutate(state = reorder(state, rate)) 

We can now easily plot disease rates per year. Here are the measles data from California:

dat |> filter(state == "California" & !is.na(rate)) |>
  ggplot(aes(year, rate)) +
  geom_line() + 
  ylab("Cases per 100,000")  + 
  geom_vline(xintercept=1963, col = "blue")

dat |> ggplot(aes(year, state, fill = rate)) +
  geom_tile(color = "grey50") +
  scale_x_continuous(expand=c(0,0)) +
  scale_fill_gradientn(colors = brewer.pal(9, "Reds"), trans = "sqrt") +
  geom_vline(xintercept=1963, col = "blue") +
  theme_minimal() +  
  theme(panel.grid = element_blank(), 
        legend.position="bottom", 
        text = element_text(size = 8)) +
  ggtitle(the_disease) + 
  ylab("") + xlab("")